Data loss prevention (DLP) refers to a technology which prevents unauthorized disclosure of sensitive information. Examples of sensitive information that can be protected by DLP include names, addresses, telephone numbers, social security numbers, credit card numbers, bank account numbers, and medical records.
One convention DLP system starts by extracting text from a document. The conventional DLP system may apply optical character recognition (OCR) to improve the accuracy of converting characters (i.e., letters, digits, symbols, etc.) in the document correctly into plain text. Next, the conventional DLP system parses the extracted text into words by narrowing the extracted text to a particular vocabulary (e.g., English, Russian, Hebrew, etc.). Finally, the conventional DLP system performs exact and fuzzy matching to match the words to restricted words and/or restricted patterns. If there is a match between a particular word and the restricted words or restricted patterns, the particular word or the entire document is prevented from being disclosed thus safeguarding the sensitive information.